PDF version of my report: https://github.com/WebsterJiang/PM566-final/raw/main/Report/Report.pdf
At the beginning of 2020, Covid-19 affects people’s life globally. Every countries enacted several policy to deal with this disease. The increasing unemployment, decreasing GDP, higher inflation and so on are signs to reflect that the economic market is under tremendous risk. For this project, I am wondering whether the different in people’s income would influence the death cases by Covid-19 in the US. Also, I would consider GDP level during Covid-19 period as the confounding variable in our analysis.
Variable Description:
State: 51 States in the US
State_full_name: Full name of each state
Lon:Longitude
Lat:Latitude
Income: Median Household Income in United States
Urban_rural_code:a classification scheme distinguishes counties by the population
Covid_death: Death caused by Covid-19
All-Causes death: All death during analysis
total_covid_death_instate: total number of death caused by Covid-19 in each state
total_all_death_instate: total number of death in each state
death_mean_urban: Average number of death caused by Covid-19 in different type of counties.
For the first dateset, I choose to use Median Income for each state in the US provided by United State Census and the link is ‘https://www.census.gov/search-results.html?q=Median+income+&page=1&stateGeo=none&searchtype=web&cssp=SERP&_charset_=UTF-8’. For the second dateset, I choose to use the collection of Covid-19 cases and all-causes death cases in each state and county in the US provided by the CDC and the link is ‘https://data.cdc.gov/NCHS/Provisional-COVID-19-Death-Counts-in-the-United-St/kn79-hsxy’. For the third dataset, I found the GDP level across each state in the US on the website “https://worldpopulationreview.com/state-rankings/gdp-by-state”.
I need to merge two datasets which contain our main effects variables: Income and death caused by Covid-19 by the variable ‘State’ to get a full dataset which is helpful for the further analysis. Then, I delete the comma occurred in some numerical number such as changing 14,500 to 14500 in order to better run the data in R. For the next step, I renamed certain variables that include ‘space’ like changing “urban rural code” to “urban_rual_code” as a whole word. Before providing some statistical result, the most important step is to check the missing value occurs in our data. For any observations with the missing value for the death cases, I just replaced them with 0. In order to better summary the key outcome by the variable ‘state’, I created new variables to reflect the total death cases in each state. For analyzing our confounding variable, we just combined our existing date ‘covid1’ with the GDP data and for a new dateset called ‘gdp_incme_covid’. For this combined data, we would measure the association between GDP level and Covid-19 deaths and the association between GDP level and Income. Since the GDP data we choose is distince enough, so we don’t need to clean this combined dataset anymore. Then, I created a table to show the details of each key variable. The table contains six variables which classified by State: the full name of the state, number of counties, GDP, Income, COVID-19 death cases and all-caused death cases. For the data visualization, I plotted 4 graphs to show the association between each key variables. For example, I used draw a US map to show the density of COVID-19 death in each state and draw a scatter plot to reflect the linear association between Income and number of Covid-19 death cases.
We checked the dimension of our data and noticed that there are 3023 total observations and 17 different factors for each of our observation. Then, I did some summaries for the key variables such as Income, GDP, Covid-19 death cases and all caused death cases. I found people living in Mississippi has the the lowest median income which is $45081 and people living in District of Coloumbia has the the highest median income which is $86420. Also, I noticed that the lowest death cases caused by COVID-19 is in Vermont which equals to 283 and highest death cases caused by COVID-19 in California which equals to 73920 and mean death cases caused by COVID-19 in the US is 20504. For the variable GDP, I found that the state Vermont also has the lowest GDP which equals to 33278 million dollar and the state California has the highest GDP which equals to 3120386 million dollar.